Goto

Collaborating Authors

 data scientist


Oblivious Sampling Algorithms for Private Data Analysis

Neural Information Processing Systems

Trusted execution environments (TEEs) canbeused to protect the content of the data during query computation, while supporting differential-private (DP) queries in TEEs provides record privacy when query output isrevealed.



Pipeline Combinators for Gradual AutoML

Neural Information Processing Systems

Automated machine learning (AutoML) can make data scientists more productive. But if machine learning is totally automated, that leaves no room for data scientists to apply their intuition. Hence, data scientists often prefer not total but gradual automation, where they control certain choices and AutoML explores the rest. Unfortunately, gradual AutoML is cumbersome with state-of-the-art tools, requiring large non-compositional code changes. More concise compositional code can be achieved with combinators, a powerful concept from functional programming. This paper introduces a small set of orthogonal combinators for composing machine-learning operators into pipelines. It describes a translation scheme from pipelines and associated hyperparameter schemas to search spaces for AutoML optimizers. On that foundation, this paper presents Lale, an open-source sklearn-compatible AutoML library, and evaluates it with a user study.


PCS Workflow for Veridical Data Science in the Age of AI

arXiv.org Artificial Intelligence

Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.


Risks and Opportunities in Human-Machine Teaming in Operationalizing Machine Learning Target Variables

arXiv.org Artificial Intelligence

Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.



Oblivious Sampling Algorithms for Private Data Analysis

Neural Information Processing Systems

Trusted execution environments (TEEs) can be used to protect the content of the data during query computation, while supporting differential-private (DP) queries in TEEs provides record privacy when query output is revealed.



Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment

arXiv.org Artificial Intelligence

Adopting Large language models (LLMs) in organizations potentially revolutionizes our lives and work. However, they can generate off-topic, discriminating, or harmful content. This AI alignment problem often stems from misspecifications during the LLM adoption, unnoticed by the principal due to the LLM's black-box nature. While various research disciplines investigated AI alignment, they neither address the information asymmetries between organizational adopters and black-box LLM agents nor consider organizational AI adoption processes. Therefore, we propose LLM ATLAS (LLM Agency Theory-Led Alignment Strategy) a conceptual framework grounded in agency (contract) theory, to mitigate alignment problems during organizational LLM adoption. We conduct a conceptual literature analysis using the organizational LLM adoption phases and the agency theory as concepts. Our approach results in (1) providing an extended literature analysis process specific to AI alignment methods during organizational LLM adoption and (2) providing a first LLM alignment problem-solution space.


Don't buy a PCIe 5.0 SSD unless you say 'Yes' to these 3 questions

PCWorld

There's a common misconception about PCIe 5.0 SSDs that since they're the latest-generation SSD storage and boast faster speed and bandwidth compared to PCIe 4.0 SSD, anyone who gets one is going to experience a big uptick in PC performance. That's certainly not the case as we've shown by analyzing things like gaming performance. But there are a few exceptions to that rule. In fact, if any of the below statements are true for you, you may well have a justifiable reason for splurging out on a costly PCIe 5.0 SSD upgrade. Scientists and other professionals work with very large amounts of data -- often terabytes but sometimes even petabytes in size.